Using R to Build a Community Data Explorer for Cincinnati (CoDEC)

CCHMC R Users Group

Cole Brokamp, Erika Manning, Andrew Vancil

5/10/23

Welcome

Join the RUG Outlook group for updates and events. {width=180%}

Upcoming BUG + RUG Event

Using R to Build a Community Data Explorer for Cincinnati (CoDEC)

  1. Introduction to CoDEC
  2. Sharing CoDEC Data
  3. Exploring CoDEC Data

Background

Equitable Data

The White House’s Equitable Data Working Group has defined equitable data as “those that allow for rigorous assessment of the extent to which government programs and policies yield consistently fair, just, and impartial treatment of all individuals.” They advise that equitable data should “illuminate opportunities for targeted actions that will result in demonstrably improved outcomes for underserved communities.” The group recommended to make disaggregated data the norm while being “… intentional about when data are collected and shared, as well as how data are protected so as not to exacerbate the vulnerability of members of underserved communities, many of whom face the heightened risk of harm if their privacy is not protected.”

The U.S. Chief Data Scientist, Denice Ross, has declared that “open data is necessary and not sufficient to drive the type of action that we need to create a more equitable society.” Open data can fall short of driving action if it is not equitable. Disaggregating data by sensitive attributes, like race and ethnicity, can elucidate inequities that would otherwise remain hidden.

Community-Level & Disaggregated Data

Data are people and when sharing data, privacy is a spectrum of the tradeoffs between risks and benefits to individuals and populations. Data collected at the individual-level by one organization often cannot be shared with another organization due to legal restrictions or organization-specific data governance policies. We are often interested in community-level (e.g. neighborhood, census tract, ZIP code) data disaggregated by gender, race, or other sensitive attributes. Achieving data harmonization upstream of storage allows for contribution of disaggregated, community-level data without disclosing individual-level data when sharing across organizations.

CoDEC

Community Data Explorer for Cincinnati (CoDEC) is a data repository composed of equitable, community-level data for Cincinnati.

  • Data about communities come in different spatiotemporal resolutions and extents and are not designed with the specific goal of integrating with other data.
  • CoDEC defines specifications for community-level data in an effort to make them more FAIR.
  • Common data specification for organizations means that organizations can more easily use methods and tools for harmonizing, storing, accessing, and sharing community-level data.
  • {codec} R package to describe, curate, and check against CoDEC specifications

Using these tools, a collection of extant community-level data resources is automatically transformed into a harmonized, community-level tabular data package that is openly available and accompanied by:

  1. a richly-documented data catalog
  2. a web-based interface for exploring and learning from data
  3. an API for accessing data at scale and on demand

CoDEC Overview

%%{init: { "fontFamily": "arial" } }%%

flowchart LR

classDef I fill:#E49865,stroke:#333,stroke-width:0px;
classDef II fill:#EACEC5,stroke:#333,stroke-width:0px;
classDef III fill:#CBD6D5,stroke:#333,stroke-width:0px;
classDef IIII fill:#8CB4C3,stroke:#333,stroke-width:0px;
classDef V fill:#396175,color:#F6EAD8,stroke:#333,stroke-width:0px;

subgraph source-box [data sources]
    org(community \norganization):::I
    jfs(government \n organization):::I
    cchmc("healthcare \n organization"):::I
    acs("built, natural, and \n social environment"):::I
end
class source-box II

stage(collection of community-\nlevel data):::I

org --> |"data \n support"| stage
jfs --> |decentralized \n geocoding| stage
cchmc --> |spatiotemporal \n aggregation| stage
acs --> |automatic \n interpolation| stage
stage --> codec-box

subgraph codec-box ["Community Data Explorer for Cincinnati (CoDEC)"]
    ingest("(meta)data harmonization"):::IIII
    data(community-level \n tabular data resource):::IIII
    data-catalog("interactive data catalog\n geomarker.io/codec"):::IIII
    ingest --> data
    data --> data-catalog
    data --> api(data API):::IIII
    api --> bindings(R code \n for accessing data):::IIII
    data-catalog --> download(explore, map, download):::V
end

class codec-box III

bindings --> dashboard("dashboards and reports"):::V
bindings --> qr(QI & research):::V
api ---> anywhere(public access):::V

Data Harmonization

  • CoDEC encodes data streams about the communities in which we live into a common format (census tract and month) so that it can be decoded into different community-level geographies and different time frames.

FAIR

  • 🔎 findable: use a unique and persistent identifier, add rich metadata (usingexisting standards)
  • 🔓 accessible: store in a data repository ( ⚠️ personal/classified information, but metadata still accessible)
  • ⚙️ interoperable: use an open file format with controlled vocabularies, reference relevant datasets
  • ♻️ reusable: good documentation including a README (with project background, organization of files, and how to reproduce the project) and a data dictionary (with variable explainations, measurement units, how missingness is encoded, etc); usage licenses (for code or data/presentations/papers)

TRUST

  • 🤲 transparent: make specific repository services and data holdings verifiable by publicly accessible evidence

  • 📃 responsible: ensure authenticity and integrity of data holdings

  • 👥 user-focused: meet data management norms and expectations of target user communities

  • ⏳️️ sustainable: preserve services and data holdings for the long-term

  • ⚙️ technological: provide infrastructure and capabilities supporting secure, persistent, and reliable services

  • transparent: make specific repository services and data holdings verifiable by publicly accessible evidence

  • responsible: ensure authenticity and integrity of data holdings

  • user-focused: meet data management norms and expectations of target user communities

  • ️sustainable: preserve services and data holdings for the long-term

  • technological: provide infrastructure and capabilities supporting secure, persistent, and reliable services

Creating and maintaining an open community-level data resource equips the entire community for data-powered decision making and boosts organizational trustworthiness. Demonstrating reliability and capability of appropriately managing shared data helps earn the trust of organizations and communities intended to be served.

CoDEC Data Available Now

https://geomarker.io/codec/articles/data.html

How to Read Data in R Using CoDEC

codec::codec_data("hamilton_property_code_enforcement")

How to Read Metadata in R Using CoDEC

codec::codec_data("hamilton_property_code_enforcement") |>
  codec::glimpse_tdr()

Sharing CoDEC Data

Frictionless Standards

CoDEC-Specific Specifications

https://geomarker.io/codec/articles/specs.html

%%{init: { "fontFamily": "Arial" } }%%

flowchart TB

classDef I fill:#E49865,stroke:#333,stroke-width:2px;
classDef II fill:#EACEC5,stroke:#333,stroke-width:2px;
classDef III fill:#CBD6D5,stroke:#333,stroke-width:2px;
classDef IIII fill:#8CB4C3,stroke:#333,stroke-width:2px;

tdr([tabular-data-resource]):::I

name(name):::II
path(path):::II
version(version):::II   
schema([schema]):::II
title(title):::II
homepage(homepage):::II
description(description):::II

tdr --- name
tdr --- path
tdr --- version   
tdr --- title
tdr --- description
tdr --- homepage
tdr --- schema

schema --- fields([fields]):::III
schema --- primaryKey(primaryKey):::III
schema --- foreignKey(foreignKey):::III

fields --- field_name_1(field_1:\nname \n title \n description \n type):::IIII
fields --- field_name_2(field_2:\nname \n title \n type \n constraints):::IIII
fields --- field_name_3(field_3:\nname \n title \n description \n type \n constraints):::IIII

{cincy}

  • CoDEC relies on the {cincy} R package to define Cincinnati-area geographies and interpolate area-level data between census tracts, neighborhoods, and ZIP codes in different years.

CoDEC

The goal of the R package {codec} is to support CoDEC data infrastructure through:

  • curating metadata for tabular data in R: vignette("curating-metadata")
  • reading and writing tabular-data-resources: vignette("reading-writing-tdr")
  • defining the CoDEC tabular-data-resource specifications: vignette("specs")
  • providing tools to check CoDEC tabular-data-resources and create an interactive data catalog: vignette("data")

Curating metadata for tabular data in R using attributes

https://geomarker.io/codec/articles/curating-metadata.html

Reading and writing tabular data resources

https://geomarker.io/codec/articles/reading-writing-tdr.html

Tools for Checking Against CoDEC Specifications

https://geomarker.io/codec/reference/check_codec_tdr_csv.html

Exploring CoDEC

Leveraging data standards for shiny?

Screenshot

Shiny

Inset panel and scatterplot

bslib layout

“crosstalk” hack

Interactive Demo

Conclusions

R …

  • curate data, metadata, visualizations all in one language

Thank You

🌐 https://geomarker.io/codec

‍💻️ github.com/geomarker-io